Educational segregation in NYC

“Achieving inclusive and quality education for all reaffirms the belief that education is one of the most powerful and proven vehicles for sustainable development.” - UN

It should be the goal of any governing organization to ensure high-quality education for all, as its benefits are high and extensive. From the fact that illiteracy means you have a substantially higher likelihood of ending up in jail or on welfare, that illiteracy has a negative impact on discrimination and preventable diseases, or the fact that for every dollar spent on adult illiteracy the ROI (return on investment) is 6.14$ (614%). Another extremely important effect of education is the social network you get, which combats loneliness which in itself has several negative health impacts [1].
Given that there is no doubt about the importance of education it’s important to investigate when the educational system fails and people drop out, and which factors have an impact on the dropout. To investigate this we’ll look at poverty data from New York City in 2015, where we focus on education, since this is one of the 17 SDGs, and education is important for developing our world in a sustainable direction.

If you want to look behind the scene we have our explainer notebook here.

An intorduction to the data

You can download the data from data.cityofnewyork.
It contains 69103 participants and 61 columns of which we only use a subset of 11 columns seen below:

  • Age of person

  • Borough (Bronx, Brooklyn, Manhattan, Queens, Staten Island)

  • Disability

  • Level of education (no high school, high school, some college, bachelor degree or above)

  • Ethnicity (white, black, asian hispanic, other)

  • Language other than english spoken at home (yes/no)

  • Sex

  • Total income

    • Interest, dividends, and net rental income

    • Self-employment income

    • Wages or salary income

    • Retirement income

We mainly use the people’s educational status, since this is what we are interested in investigating. Additionally, we use other above-stated features, to see which have an influence on education.

Furthermore, we are only looking at adults (people older than 24 years old) as people younger than 24 have not had a fair opportunity to finish a bachelor’s degree. Hence we remove all rows with younger people, this gives us a little less than 50000 participants.

The data is generated annually by a research unit in the Mayor’s office. It is derived from the American Community Survey Public Use Micro sample for NYC. We will be using the data generated for 2015 as the foundation for this article.

A look at education in NYC in regards to SDG 4.1

The main subject at play is which education people have attained. In [sdg 4.1] we see that the goal is for all to have finished a second-level education which corresponds to high school in our case.

Percentage
EducAttain
Less than High School 18.0
High School Degree 24.0
Some College 20.0
Bachelors Degree or higher 38.0

For our dataset, we do see 18% not achieving this goal. We will Further investigate what impact this has.

How education impacts salary

An obvious attribute that we would expect education to have an impact on is income. Intuitively, more education leads to better and more well-paid jobs, so let’s investigate this claim. We’ll do this by looking at the distribution of total income in the different education groups

_images/Report1_10_0.png

Interestingly, the individual with the second highest income is (the max line in green) without a high school diploma ie. more than both people with a high school degree and some college. What this means is just that a high salary/income can be obtained without having any education and not that you can expect a lower maximum income if you have some college education. In fact, what we see is true that you can expect a higher salary the higher your education level. This can be seen in both the average (red line), and median (purple line). Thus it’s easy to conclude that education is an effective tool against poverty. However, it’s important to note that we know nothing of the jobs that people occupy, so a higher salary does not necessarily mean a job that is a “vehicle for sustainable development” [UN].

Sex a hopeful story

SDG 4 says: “Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all” [2] this unsurprisingly also includes women. Worldwide we know there is a discrepancy between males and females from Hans Rosling’s quiz in the opening of his famous book Factfulness: “Worldwide, 30-year-old men have spent 10 years in school, on average. How many years have women of the same age spent in school?” the answer is 9 years [3]. This number is of course not US or NYC specific, thus let’s take a look at the distribution of sexes for the different achieved educations in NYC:

_images/Report1_13_0.png

It’s fairly clear that there is no difference in education obtained between the sexes. So nonetheless they are doing pretty good in NYC regarding gender equality in the educational system

Salary and Sex a sad story

Although the equality of education between the sexes is a good sign, it’s an entirely different and alarming story when looking at sex and salary:

_images/Report1_16_0.png

Here we have a fairly big discrepancy as both the average and median is significantly higher for males (about 70%). This is very alarming as it contradicts our previous conclusion: that higher education means a higher salary and thus an effective tool against poverty. The two figures above suggest that although females have an equal amount of education as men, they still have a lower average salary, and thus a higher likelihood of being in poverty. Now you might think that this low salary could be explained by Stay-at-home-moms but there is a difference in salary between men and women even when removing instances of people not earning any money:

_images/Report1_18_0.png

The difference is still significant (about 35%). And even if the entire difference could be explained by stay-at-home-momes, there is still a question if it should be the case, as this is gender inequality no matter if it is voluntary or not.

The blooming of educated non-white ethnicities

NYC is a multicultural city with people coming from all ethnicities. Do all of them get the same education opportunities?

_images/Report1_21_0.png

We do see that white, Asian, and others over 50% have a higher education, whereas the majority of the Hispanics in our dataset have less than high school education which is beneath SDG 4.1. But could this be a historical issue and no longer be the case? Looking at general education we would expect it to be higher the lower the age (for people older than 24 years old) since the focus on education and resource helping people to get an education has changed dramatically. Additionally, the American society is generally less segregated, especially compared to say the 60s, thus we would expect to see a greater increase in education for all other races.

_images/Report1_23_0.png

The Hispanic race seems to be having the most trouble with obtaining an education even for the younger generation of 20-30, where there are still some age groups where high school education is the most frequent. This may be troublesome for many due to the increasing importance of a college degree steadily increasing [4]. But alike our theory the younger generation is having a higher education which is a result of the increased focus in education.

The vicious circle in boroughs

New York City is divided into five different boroughs each with its own flavor [5]:
The Bronx is one of the most prominent centers of urban poverty in the United States.
Brooklyn collision of old and new
Manhattan center of NYC and the representative of NYC with central Park, Broadway show, and Times Square
Queens primary middle-class families and the most ethnically varied of all the boroughs
Staten Island the most rural part of the city

Thus it seems fairly intuitive that these boroughs also represent different demographics of the NYC population. To investigate how different the demographics in the different boroughs are, let’s start by looking at the average income.

Clearly, there is a big difference between living in Manhattan and Bronx, the average income in Manhattan is over three times that of the Bronx, even though they are two neighboring boroughs. Where Brooklyn, Staten Island, and Queens are much more similar but still far behind Manhattan. However, Manhattan is as mentioned home to all the biggest and most shining stars of capitalism like Wall Street, Trump Tower, Empire State Building, etc. which not only means insanely high property values (it is 3 times the average [6] of NYC [7]), but also an aggregation of wealth.

As we’ve mentioned previously there is a correlation between a high income and high level of education. Thus the heatmap above would seem to indicate that a lot of highly educated people will be centered in Manhattan, with Bronx severely lacking behind. Let’s see this in heatmaps of percentege of people with a certain level of education by boroughs.

Sadly this is exactly what we see. Clearly, wealth and education are centered in Manhattan, with about two-thirds having a bachelor’s or higher, while it’s only about a third in Brooklyn, Staten Island, and Queens. What’s worse is when looking at the Bronx, here people with less than high school maintain the highest share of the different education categories. Where this starts to become a big problem is the effect that parents’ level of education has on their children. Parents with a lower level of education mean that their children have a much lower likelihood of obtaining a higher level of education [8]. This means that boroughs are effectively a positive feedback loop, where well-educated parents produce well-educated children, which gives a higher income making Manhattan even more expensive and so on. Thus boroughs can fuel the discrepancy in NYC.

Finally, let’s see how the different ethnicities are situated in NYC. Below we plot what percentage of an ethnicity is situated in a given borough.

Here another bleak picture is painted. Manhattan is predominantly white meaning wealth and education are still white, and the Bronx is also mostly not white, thus white is also not poor. Hispanics are mostly in the Bronx which corresponds to the fact that Hispanics are the worst educated Ethnicity. Ending on a high note, we do see that Blacks are well spread out in NYC but Hispanics account for 80% of the population of Bronx.

Creating a model that should be bad at predicting (but isn’t)

Imagine you’re a school principal and would like to find out which aspiring new kids will attain some level of college education later in life. You do this so you don’t have to waste any time and recourses on people you believe are ultimately undeserving. To do this you gather all the information you can about your new students like Borough (location), Disability, languages other than English spoken, Ethnicity, and Sex.

Ultimately this is a list of attributes that really should not have any influence on the level of education an individual will obtain, which is why the ML model used to predict this is hopefully bad.
Interestingly we would expect languages other than English spoken to have a negative impact on the level of education attained, as the majority of Americans only speak English, thus speaking another language than English is an indication the person is of another race. Of course, attributes like Ethnicity and Sex having an influence on education would go directly against SDGs 4, and 5.

Finally, as we saw in the heat maps there is a high concentration of Hispanics and less than high school education in the Bronx, thus it is logical to conclude that Borough would have an impact on education.

What is a classification model?

The process of predicting whether an individual will attain at least some college education or not (0 = less than high school and high school, 1 = some college and bachelor’s degree or higher) is called a classification model, and our classification model consists of decision trees.

A decision tree works by asking “yes/no” questions, for instance: “Is this person a male?”. This creates a split in the tree. Based on the answer to the question a new question is asked on each branch, creating new splits. Multiple splits are created in this way, such that for each split we get more and more information about our data. We select the questions such that the two resulting subgroups from the split are as different as possible, and the data points within each subgroup are as similar as possible. To calculate this we can use a measure called entropy. Finally, we have split the data into multiple subgroups, where for each subgroup there will be a higher probability of predicting the right target class, than if we had not asked any questions.

We then create multiple of these decision trees, all different and uncorrelated. To do a prediction we can then combine these decision trees and predict what most of the trees are predicting. This is what is called a random forest. The random forest minimizes errors in the classification because we get inputs from multiple decision trees, hence one wrong tree prediction will not make a difference as long as most trees predict correctly.

Finally, we are also performing a randomized search to select the best random forest model. The randomized search simply creates multiple random forest models with different parameters such as “number of trees in each forest”, “maximum number of levels in each tree” etc. It then runs and evaluates all models, selecting the random forest which performs best.

The depressingly good performence of our model

The binary classification model we have created “unfortunately” performs rather well. Unfortunately, this indicates external factors such as race, disability, location, etc. have an impact on achieving higher education in New York City. It means that we based on a person’s rather neutral features can predict whether or not this person has an education (or will get one). It is a feature that we believe should not have an influence on whether or not a person will receive higher education. We can compare our model to a baseline model, which predicts everyone to have higher education. Whereas our model makes a prediction for each person based on the previously mentioned features.

We get the following statistical performance measurements for the baseline and random forest model respectively:

Baseline model

Performance Measurement

Performance

Accuracy

0.59

Precision

0.59

Recall

1.00

F1 score

0.74

Random Forest Classifier

Performance Measurement

Performance

Accuracy

0.65

Precision

0.67

Recall

0.79

F1 score

0.73

These measures are all in the range of zero to one, where zero is the worst and 1 is the best. We see that the random forest model has the best accuracy and precision, whereas the baseline model has a better recall and f1 score. The accuracy is a measure of how many correct predictions are divided by the total amount of predictions. Hence our random forest model overall does predict more correctly than the baseline model. Precision and recall are two measures often seen together. One way to explain them is that precision is a measure of quality and recall is a measure of quantity.[precision, recall wiki]. Because precision tells you how many of the higher education predictions are correct, and recall tells you how many of the higher education instances you predicted are correct (also known as True positive (TP)). So naturally, when predicting every instance to be higher education, the recall will be 1.

Finally, the F1 score is a mean between the two measurements.
Hence to evaluate the model we need to look at what the model should be used for. If we assume a rather uncomfortable thought that the NYC government would predict who gets an education in order to know which people to spend recourses on, then their goal is probably to only invest in the people who get higher education. And not invest in a person and risk investing in someone who does not get the education. Here they would prefer a higher precision over a higher recall because this indicated that they seldom invest in the wrong people, but rather invest a little less, though they will miss some potential good investments. Hence in this case our model would actually be useful.

The above-stated thought is morally uncomfortable, but nonetheless, it is the sad truth, that we based on the data of the NYC situation in 2015 can see the segregation from the model and could use people’s neutral features to predict education.

Talking about segregation we can look into our model to see if it is biased, and in which areas it is most biased. In our case, we know that the model is biased since we only included features that should not have an influence. But we can take a look at which features have the most influence, and hence where the model is most biased. Therefore we have plotted the importance of the different features. This is simply a measure of how important each feature is for the prediction.

_images/Report2_6_0.png

We see that Non-Hispanic White is the most important, and Hispanic, any Race is the third most important. (With Manhattan being the second most important). Hence the model sees the person’s race as an important attribute to predict the education level attained. This does not seem fair.
Luckily the sex does not seem that important, which is exactly what we saw in the bar plot of gender in each education level.

Next, we have plotted the normalized confusion matrix of the model’s overall performance. It shows the ratio between the actual educational status and the predicted educational status. These numbers are used to calculate the statistics we use to evaluate the model before. From the matrix, we see that our model is better at predicting the people who do get an education than the people who don’t get an education. Hence the model leans towards predicting that people will get a higher education.

Finally, we go back and look if our concern about the model being very biased toward race holds. To investigate this we have plotted the difference in the performance of the model for three ethnicities: Non-Hispanic White (white), Non-Hispanic Black (black), Hispanic Any Race (Hispanic).

On the y-axis we have the different labels TN (true negative: correctly predicted no higher education), FP (false positive: predicted higher education, but had no higher education), FN (false negative: predicted no higher education, but had higher education), and TP (true positive: correctly predicted higher education). The x-axis shows the difference between the chosen race and the overall model, which is not divided.

From the plot, we see that for the white people the model has much fewer true negatives and false negatives, and a lot more false positives and true positives. This means the model very seldom predicts a white person not to have an education and most frequently predicts white to have a higher education. For Hispanic and black people we see an opposite trend. Meaning the model tends towards predicting black and Hispanic people don’t have an education. Hence we were right in our concern about the bias in the model. This bias is simply due to the data being biased, indicating racial segregation in the city of New York. This is an unfortunate fact we cannot change.

Making the model fair

If we would use the model to predict education and do not want the model to be unfair to some races, we can debias the model according to the race. We have chosen white and Hispanic, since these are the most influential, and included black as well since there is a lot of history regarding black peoples’ educational rights (and of course rights in general). The idea behind debiasing is to make the model equally fair for all races. The way we wish to make the model fair is by getting a similar true positive rate $(TPR= TP/(TP+FN))$ and false positive rate $(FPR=FP/(FP+TN))$ for the races. Ideally, we want a high true positive rate and a low false positive rate.

For our current overall model, we have a TPR (true positive rate) of 0.79, and an FPR (false positive rate) of 0.54. But for the white, both of these measurements are higher, and for the Hispanics, both are much lower. We cannot change the model, but we can change how the model predicts. The model gives each person a probability of that person having a higher education. Normally if the probability is above 0.5 (50%), it predicts the person to have a higher education, if it’s under it predicts the person to not have a higher education. The probability of 0.5 is called the threshold. To debias the model we can change the threshold based on which race we are looking at. We will find which thresholds give us the best and most similar TPRs and FPRs for whites and Hispanics respectively.

To do this we’ve calculated TPRs and FPRs for both races for some different thresholds. We then plot the FPR on the x-axis and the TPR on the y-axis, where each point corresponds to a threshold. This is called an RUC curve.

Loading BokehJS ...

Evaluating the RUC curve visually, we want to find a high TPR, low FPR, and three thresholds (points) that are close to each other making the rates for the races similar. We have selected the following three thresholds:

Black: 0.53
White: 0.68
Hispanic: 0.42

This means that for black the probability needs to be above 0.53 to predict higher education, whereas for white it needs to be above 0.68, and for Hispanics it only needs to be above 0.42.

The specific values for the thresholds that debiases our model reveals how segregated the population is. We see how much we need to change the threshold according to each other in other to achieve a fair model. Specifically, the threshold for Hispanics is almost two-thirds of white.

We can now take a look at the effect of our debiasing. To do this we have plotted the TPR and FPR for each race, before and after the debiasing, to see how we have minimized the difference between the races.

_images/Report2_11_0.png

Hence after debiasing our chosen thresholds results in the following rates:

TPR

FPR

Black

0.59

0.45

White

0.65

0.34

Hispanic

0.66

0.49

So now we have a quite fair model regarding the ethnicities (the rates are almost similar for the three races), and it performs rather okay. The TPR is certainly higher than the FPR, hence it predicts correctly more than it predicts incorrectly.

The path to a better future

Hopefully, the reader is convinced of the importance of education. Not only is it the standpoint of the UN that all countries should strive to ensure a high level of education in their population, as it is the best tool to achieve sustainable development, but we also saw a high positive correlation between income and education, which help against poverty.

We saw that females have on average a higher education than males, however, this did not result in females having a higher average salary, in fact, quite the opposite as males had an approximately 50% higher salary. And this worrying discrepancy is something we saw throughout the data. Hispanics were much lower educated than whites, with blacks being somewhere in the middle between the two.

This difference in the race had a big influence on the machine learning model we created. Being white meant that the model much more often thought an individual would have some level of college, whereas the opposite was true for Hispanics, and again Black was somewhere in the middle. Interestingly we saw that the borough of Manhattan also had a positive influence in predicting education. Thus we have two external factors (race and borough) that heavily influence the level of education an individual will attain. This is a problem, obviously, you would prefer these factors not to have any influence, and certainly not to the extent that they currently have.

If you would still want to use the model, we showed how to make the model predictions equally fair for the ethnicities. We did this by changing how certain the model has to be of predicting higher education. Effectively this means the model treats white people worse than it should and Hispanics better. It does this however to make up for the racial inequalities that exist in New York City.